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ABSTRACT 

The paper first recalls the Blahut Arimoto algorithm for computing 
the capacity of arbitrary discrete memoryless channels, as an exam- 
ple of an iterative algorithm working with probability density esti- 
mates. Then, a geometrical interpretation of this algorithm based 
on projections onto linear and exponential families of probabilities 
is provided. Finally, this understanding allows also to propose to 
write the Blahut-Arimoto algorithm, as a true proximal point algo- 
rithm, it is shown that the corresponding version has an improved 
convergence rate, compared to the initial algorithm, as well as in 
comparison with other improved versions. 

Index Terms — Iterative algorithm, Blahut-Arimoto algorithm. 
Geometrical interpretation, Convergence speed. Proximal point method. 

1. INTRODUCTION 

In 1972, R. Blahut and S. Arimoto i7]|2y received the Information 
Theory Paper Award for their Transactions Papers on how to com- 
pute numerically the capacity of memoryless channels with finite 
input and output alphabets. 

The Blahut-Arimoto algorithm was recently extended to chan- 
nels with memory and finite input alphabets and state spaces [3 /. 

Recently, an algorithm was proposed for computing the capacity 
of memoryless channels with continuous input and/or output alpha- 
bets where the Blahut-Arimoto algorithm is not directly applied f4lj. 

In [5], information geometric interpretation of the Blahut-Arimoto 
algorithm in terms of alternating information projection was pro- 
vided. Based on this last approach, Matz [6] proposed a modified 
Blahut-Arimoto algorithm that converges significantly faster than 
the standard one. 

The algorithm proposed by Matz is based on an approximation of 
a proximal point algorithm. Instead, we propose a true proximal 
point reformulation that permits to accelerate the convergence speed 
compared to the classical Blahut-Arimoto algorithm and also to the 
approach in [6]. 

Our contributions regarding capacity computation for discrete 
memoryless channels (DMCs) in this paper are: 

• Geometrical interpretation of Blahut-Arimoto algorithm in 
terms of projection onto linear and exponential families of 
probability. 

• True proximal point interpretation. 

• Improvement of the convergence rate based on the proximal 
point formulation. 



2. TOOLS 

2.1. Kullback-Leibler divergence and Mutual Information 

The Kullback-Leibler divergence (KLD) ^ |Sp is defined for two 
probability distributions p = {p{x),x £ X} and q — {q{x), x £ X} 
of a discrete random variable X taking their values x in a discrete set 
Xhy: 

The KLD(also called relative entropy) has some of the properties of 
a metric: D{p\\q) is always non-negative, and is zero if and only if 
p = q. However, it is not a true distance between distributions since 
it is not symmetric {D(p\\q) ^ D{q\\p)) and does not satisfy the 
triangle inequality in general. Nonetheless, it is often useful to think 
of relative entropy as a distance between distributions. 
The channel capacity is given by: 
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C = max /fx, y) 

p(x) 

Where the mutual information of the two discrete random variables 
X and Y is given by : 

l(X.Y) = Ep{D(j,(y\x)\\p(y))} 

2.2. Linear and exponential families of probability 

A linear family of probability is defined as : 
V/i, /2, . . . , /k G X and Vai,a2, • • • , Oic 

£ ^ {p -.Epif.ix)) ^ < i < K} 

The expected value Ep(/i (x)) of the random variable x with respect 
to the distribution p{x) is restricted to ai. A linear family of proba- 
bility is characterized by {fi{x)}i<i<K and {ai}i<i<A'. 
The vector a — [ai, . . . , ajc] serves as a coordinate system in the 
manifold of the linear family. These4 coordinates are called "mix- 
ture coordinates". 

An exponential family [5^] of discrete probability distributions 
p{x) on an alphabet X is the set 
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The exponential family £ is completely defined by ,fi{x) and Q[x) 
and parameterized by 9i . 

The distribution Q{x) is itself an element of the exponential family. 
Any element of £ could play the role of Q(x), but if it is necessary 
to emphasize the dependence of £ on Q{x), we will write £q. 

3. BLAHUT-ARIMOTO-TYPE ALGORITHM 

3.1. The original Blahut-Arimoto algorittim 

Let consider the case of a discrete memoryless channel with input 
symbol X taking its values in the set {a;o, . . . , xa/} and output sym- 
bol Y taking its values in the set {yo, . . . , i/n}- This channel is 
defined by its transition probabilities channel matrix Q as [Q]ij = 
Qiy = Pr{Y = yi\X = Xj). We also define pj = Pr[X = Xj) 
and qi = Pr{Y = yi). 

The mutual information is given by: 

Ai N n '^^ 

I(X.Y) = I(p.Q) = Y.Y.P^Q'y — = I]P^-D(Q,||?) 

j = i = 3 = 

And the channel capacity by: 

C = max/(p, Q) 

V 

By solving this maximization problem and taking into consider- 
ation the normalization condition: p{x) = 1, we find: 

_ c^p{D{P{Y\X=Xj)\\P(Y))) 

Hence, the Classical Blahut-Arimoto algorithm (Tl|2l is an iterative 
procedure: 



Jk + l) 



(X) 



pW(:c)exp(7?J) 



with = D{p(Y = y\X = x)MY = y^"^)). 



(1) 



3.2. Geometrical Interpretation of Blahut-Arimoto Algorithm 

The Blahut-Arimoto algorithm in (TJ can be recalculated as a mini- 
mization problem: 

r minp D{p{x)\\p^''\x)) 
< s.c I^''\p{x)) = a 

where I^''\p{x)) = E,p{D{p(y\x)\\p'' (y))} is the current capacity 
estimate at the iteration k and a is related to the Lagrangian multi- 
plier of this minimization problem. 

The Lagrangian corresponding to this minimization problem can be 
written as follow: 

£ = D{p{xW\x)) - Ai(/«(p(x)) - q) - A2(E.pW - 1) 

ally = ^ log(p(x)) + 1 - log(p(*n^)) - AiO^ - A2 = and 
p{x) = p('=)(a;) exp(A2 - 1) exp(AiD^) 

Taking into consideration the normalization constraint, we can easily 
obtain that exp(A2 - 1) = — rrrrT — r,k, and p<'=+^)(a;) = 

p('=)(x)exp(Aig^) 
J^^p('!)(i)exp(AiDj) 

In the following, we will see that this parameter Ai is a step size pa- 
rameter which, for convenient values, can accelerate the convergence 



speed of the classical Blahut-Arimoto algorithm in which Ai = 1. 
So the Blahut-Arimoto Algorithm can be interpreted as the projec- 
tion of p^''\x) onto a linear family of probability £ at the point 
p'^'+i' (a;) where £ is defined by ^ ^ D{p{y/x)\\p'^^\y)) 
and a\ such as Ep(D*) = aj. 

By choosing increasing , we would ensure that the mutual in- 
formation increases from one iteration to the other ij>{x)) > 
7'*^' {j){x))). However, this quantity is only implicitly defined in the 
algorithm and an appropriate choice is not available.In the following, 
we show that this problem will be solved based on a proximal point 
interpretation that ensures that the mutual information increases dur- 
ing iterations. 

Note that this linear family of probability is changing from one 
iteration to the other. 

On the other hand, the Blahut-Arimoto algorithm can be interpreted 
as the projection of a probability density function (pdf) onto an expo- 
nential family of probability £ defined by Q{x) — p'''' {x), f[''^ (x) = 

and parametrized with S^''' at the point p'*^^'(a;). 

To do this, we should solve this problem: 

minD(_R(a;)||p(a;,6l)) 

0\ _ Q(x)exp{efi(x) 
Py-^^^) - Y.^ Q[x)exp(0fi(x)) 

where R{x) is a certain pdf. We try now to find some interest- 
ing characteristics of R{x). To do this, let solve the minimization 
problem given above. — with logp{x,6) — 

logO(x) + 9Mx) - log(E, Q{x) eMefiix))) 



£xfi(^)Sx Q(x)fi(x)e^p(eh{x)) 







SoJ2xRi^)hi^) - ^^Q(x)e^p(eh(x)) 

Hence E. Rix).h (x) - ^ g^'g^l^/ri^'lt" Ri^) = lead- 
ing to Ex W^) - Pi^^ S))Mx) = having that E, R{x) = 1 
andp(2;,6i) - Q(^) ''''p(9/i(^)) 
We obtain 



^^Q{x)c^p(efi{x))- 



J:JR{x)-p'-''+'\x))D'^^0 
Which can be reformulated as 



(fc + 1) 



Hence the Blahut-Arimoto algorithm can be interpreted as the pro- 
jection of pdfs R{x) with higher mutual information than I{p^'°^ (x)) 
onto an exponential family £ defined by Q{x) = p''^-' (x), /[''^ (x) = 
and parameterized by 9[''^ = 1 /Afe at the point p'^^^' (x). Note 
that this exponential family is also changing from iteration to another 
since Q(x) and f[''\x) depends on the iteration. Here again, an ap- 
propriate choice of the parameter for increasing convergence rate is 
difficult, because of the implicit definition of the family. Thus, a 
proximal point interpretation maximizing explicitly the mutual in- 
formation is considered with a given penalty term. 

3.3. Proximal point interpretation of B.A. and amelioration in 
terms of convergence speed 

Following the results above, and based on a proximal point inter- 
pretation, we can solve the problem stated by the implicit definition 
of the families. In fact, we propose a clear equivalence with a true 
proximal point interpretation, in which all constants are explicitly 
defined, thus allowing to propose convergence rate improvement. It 
is easily shown that the Blahut-Arimoto algorithm is equivalent to 

p'-''+^\x) = argmax{j('=^(p(a;)) - D(p{x)\\p'^''\x))} (2) 



In fact, by deriving this expression over p{x) and set it equal to zero, 
we find exactly the iterative expression of the Blahut-Arimoto algo- 
rithm. 

But till now we cannot say that the Blahut-Arimoto algorithm 
can be interpreted as a proximal point method since the cost func- 
tion I^''^ {pix)) depends on the iterations, just like the families were 
depending on the iterations. In fact, a true proximal point algorithm 
can be written for a maximization problem [9 J as follow : 



□ (fc+i) 



argmax{e(6l)-/3fe||6»-6» 



(fe)||2 



} 



(3) 



in which £,{9), the cost function to be maximized, is independent 
from the iterations, \\9 — 6^''^ \\'^ is a penalty term which ensures that 
the update remains in the vicinity of S'''' and Pk is a sequence 

of positive parameters. In 1101 , Rockafellar showed that superlinear 
convergence of this method is obtained when the sequence Pk con- 
verges towards zero. 

The definition of the proximal point algorithm in ([3} can be general- 
ized to a wide range of penalty terms leading to this general formu- 
lation: 

where f{e, 6^''^) is always non negative and /(6l'''\ 6'''"'') = 0. 
The mutual information I{p{x)) can be expressed as: 



/(p(x)) = /W(p(x))-D(q(y)||qW(y)) 



(4) 



Introducing Q in l[2]l leads to 



P 



{X) 



,{/(p(x)) - {D{p{xW\x)) - D{q{y)U''\ym 



arg max 



This new formulation establishes a clear link with the definition of 
the capacity based on the mutual information. However, for a true 
proximal pint formulation, we need to show that: 

D{p{x)\\p<^''\x))-D{q{y)W\y))>Q 

with equality iff p{x) — p'''^\x) and q{y) — q''''\y) in order to 

prove that the Blahut-Arimoto is a proximal point algorithm. 

The penalty term D{p{x) \ {x))-D{q{y) | jq^*' {y)) can be rewrit- 



ten as E. 



'P(^,B)[108 pi''){x)-ZiVi.y\i)p(i) 

)rdii 

{p{a:,i/)[-log 



We can also write according to Jensen's inequality ^ : 

P'^'^ {x)Y,^p{y\x)p(x) 



p(a;)EiP(j/l*y*'(2) 

> -log(^^p(s,t/)) = 



(5) 
(6) 



This proves that the Blahut-Arimoto algorithm can be interpreted 
as a true proximal point method where the cost function is the true 
mutual information and the penalty term reads 

D{p(x)U''\x)) - D{q{y)\W^\y)) 

The corresponding proximal point algorithm reads: 

= argmaXp(,){/(p(x))-Afe(7?(p(a;)|lp('^)(x))) 
~D{q{y)U^\y))}\ 

(7) 

where is the step size introduced in order to accelerate the con- 
vergence rate of the classical Blahut-Arimoto algorithm. 
By deriving this function 



l{p{x)) - \k{D{p{xW^\x)) - D{q{y)\W^\y))) 
and set it equal to zero we find: 

= pW(x)exp{E,p(yWlog^- 



+ii:E,p(yWiog^} 



Here, it is important to note that we can obtain the classical case by 
simply replacing by 1. 

Moreover, we can also obtain the approach proposed by Matz |6| 
by intuitively replacing the probability distribution q(\j] in the right 
hand of the equation by the same distribution calculated at the pre- 
vious iteration Namely: 

p('=+i)(x) = pW(a,)exp{E,p(yl^)logf|^- 

After normalization, we getp'*"*"^' [x) = p**' (a;) exp(D^: /A^) which 
is the expression of Matz's approach. This is globally similar to the 
One-Step-Late algorithm suggested by Green/77 / 

We conclude that Matz's approach is based on an approximation of 
the proximal point method, but what is lost in comparison with the 
true proximal point method is the guarantee that the method con- 
verges, since convergence conditions must be reviewed again. 
We can write according to 0: 

J(p(''+i)(x)) - 
A,(i3(p(^+i)(x)|b«(a;)) - D{q^^^^\y)U^\y))) > 
/(pW(x)) - MDip'^''\xW\x)) - D(gW(j/)||g('''(j/))) 

Hence 

I{p'-''+^\x)) > 

/(p«(x)) + A,(i3(p(*+^'(:c)|ip«(a;)) - Diq<^'^+'\y)\\q<>''> (y))) 

To ensure the increasing of the mutual information during iterations, 
we must have: 

/(p<'=+i'(s)) > /(p<'='(a;)) 

SothatAfc(0(p('=+i'(a;)lb''n^))--D(g('=+^'(y)lk'''(y))) > 
which is true, from {5} for every Afc > which is not true in the 
approach proposed by Matz. In our method, we choose Afc such 
that: 

max,, XkW^+'\xWHx)) - D{q^'^+^\yWHy))) 

in which p^''^^' (a;) and q'-''^^\y) depend on Afc. 

This ensures that the difference between /(p''''^^' (x)) and lip'"'''' (x)) 

is as maximum as possible from one iteration to the other one. Note 

that this maximization problem is solved by the conjuguate gradient 

method which gives the most convenient value of the step size Afc 

comparing to the approach proposed by Matz. 

Note that, in terms of algorithmic complexity, the updated value of 

Afc in each iteration requires: 

(N-l-M-l-1) divisions and (N-l-M) multiplications in Matz's approach. 
(2N-I-M-I-1) divisions, (2N-I-M-I-2) multiplications and 2 additions in 
our case based on the proximal point method. 
Hence, our method requires less than twice operations per iteration 
compared to the approach proposed by Matz, however, it converges 
faster (as we can see in the simulation results showed below, the it- 
eration number is divided by two in the worst case). A compromise 
must be established depending on our interests. 



4. SIMULATION RESULTS 

First, we test the 3 versions of the Blahut-Arimoto iterative algorithm 
on a Discrete Binary Symmetric Channel (DBSC) defined by the 
transition matrix : 

f 0.7 0.2 0.1 1 
^ - \ 0.1 0.2 0.7 J 

The results (figlDl show that the channel capacity is achieved after 20 
iterations in the classical case, 7 iterations in Matz's approach and 4 
iterations in our case (with a precision of 10~^^). 
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Fig. 3. Comparision between the 2 approaches in the case of a Gaussian 
Bernouilli-Gaussian channel 
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Fig. 2. Comparision between the 3 approaches in the case of a DBSC chan- 
nel 

A second example intends to characterize better the efficiency of 
our method in comparison with the one by Matz. In order to do so we 
need a higher dimension problem. We have chosen the discretization 
of some continuous Gaussian Bernouilli-Gaussian channel in order 
to form a transition channel matrix Q with higher dimensions. Such 
a channel is defined as follows : 

yfc = a;fc + bfc + 7i; 

where 

• fo~A/'(0,crf) 

• 7fc = e.k9k with e : Bernouilli(p) sequence 

• 5 - A/'(0, cTg) with gI < u'l 
Hence 

Vk = Xk + rik 

with 

p{nk) = {1 ^ pW{0,a!) + p^{0,a! + a^) 

The output Uk has been discretized on 40 values, and the input Xk 
on 10 values. The results plotted on (fig|3j for parameters (p = 
0.3,(76 — 0.01, CTg = 1) show the acceleration of the Blahut-Arimoto 
algorithm from 14 iterations in Matz's approach to 7 iterations in our 
method. 
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5. CONCLUSIONS 

We have proposed geometrical interpretations and improvements on 
the Blahut-Arimoto (BA) algorithm for computing the capacity of 
discrete memoryless channels (DMC). Based on the true proximal 
point approach and solving the maximization problem with the con- 
jugate gradient method, we have accelerated the convergence rate of 
this iterative algorithm compared to the aproach proposed by Matz 
which is based on an approximation of the proximal point method. 
We are currently investigating the use of similar techniques for im- 
proving the convergence rate of other iterative algorithms. 



